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Abstract. We present an experimental study of large-scale RDF feder- 
ations on top of the Bio2RDF data sources, involving 29 data sets with 
more than four billion RDF triples deployed in a local federation. Our 
federation is driven by FedX, a highly optimized federation mediator 
for Linked Data. We discuss design decisions, technical aspects, and ex- 
periences made in setting up and optimizing the Bio2RDF federation, 
and present an exhaustive experimental evaluation of the federation sce- 
nario. In addition to a controlled setting with local federation members, 
we study implications arising in a hybrid setting, where local federation 
members interact with remote federation members exhibiting higher net- 
work latency. The outcome demonstrates the feasibility of federated se- 
mantic data management in general and indicates remaining bottlenecks 
and research opportunities that shall serve as a guideline for future work 
in the area of federated semantic data processing. 



1 Introduction 

The vision of the Semantic Web, i.e. transforming the current Web of Documents 
into a Web of Data, has been gaining more and more attention lately. Connect- 
ing not only documents on the web but establishing connections on the data 
level, opens up new possibilities of automatic interaction, knowledge representa- 
tion, question answering, and knowledge acquisition that has not been available 
before. Especially, the Linked Open Data [2j community has been working on 
providing links between RDF data on the Web - making RDF and SPARQL the 
popular standards for data representation and querying on the Semantic Web. 

The Linked Open Data cloud now consists of 295 data sources and about 31 
billion RDF triples - and is constantly growing. One of the core principles of 
Linked Data is to use Uniform Resource Identifiers (URIs) as unique identifiers 
that globally represent a specific entity and can be used across data sources 
to interlink resources. As the data provided on the Web and by each source 
is rapidly outgrowing the capacity of purely explorative querying — DBpedia 
for instance now has about 1 billion triples — some sources provide their data 



collections for download as RDF dumps or enable access via SPARQL endpoints. 
Accessing a data set through its SPARQL endpoint has two major advantages 
over downloading RDF dumps. First, it allows to evaluate complex queries over 
the data set without the need to set up a private triple store, possibly even on 
expensive high-end hardware. Second, data behind SPARQL endpoints is often 
more up-to-date compared to available dumps (which may be updated only in 
large intervals and therefore not include recent updates to the data set). 

For queries that include multiple data sets, connecting multiple SPARQL 
endpoints to a federation comes with a number of benefits over a centralized 
integration in a single triple store: (i) the data is always up-to-date; (ii) the 
computational load is shared among servers (which holds even for local federa- 
tions, where different data sets are kept in different triple stores on local servers); 
(iii) available endpoints can be integrated into federations ad hoc, avoiding the 
often time-consuming process of loading dumps into local repositories; and (iv) 
increased flexibility, allowing to use and query arbitrary combinations of the data 
sources in different, requirement-tailored federations. The latter two are partic- 
ularly important when local data sources are combined with public endpoints. 

As indicated by our experimental results in this paper, for typical queries 
against large federations effectively only a small subset of the endpoints con- 
tribute to the final result. Consequently, splitting up a query into subqucries 
and evaluating them in parallel over a local federation can be even faster than 
evaluating the full query over a single triple store containing all the data of all 
sources (compare, for instance, the experimental results in (9l). There remains, of 
course, a tradeoff between the benefit of distributed and parallel processing and 
the communication overhead between different instances, so that some queries 
would be evaluated more efficiently in a centralized setup. 

Having seen many promising results in previous benchmarks on federated 
query processing with only few federation members and in the order of a hun- 



dred million triples [5||9||10j, the goal of this paper is to demonstrate the prac- 
ticability of large-scale RDF federations: using FedX [To] , a highly optimized 
Linked Data federation mediator, we set up a federation with 29 SPARQL end- 
points hosting the individual data sets from the Bio2RDF domain, containing 
more than 4 billion RDF triples in total. In our experimental results, we study 
the performance of queries against such federations, comparing (i) local feder- 
ations with all SPARQL endpoints running on servers in a local network and 
(ii) hybrid federations where some sources are hosted locally and others on the 
Web or on Amazon EC2; the latter setup is a classical setting in the enterprise 
context, where companies need to combine local, private data sources with open 
data accessible through public SPARQL endpoints. It is beneficial whenever lo- 
cal working copies of some data (e.g., generated based on information extraction 
from natural language text, latest experiments, user-generated/corrected data, 
downloaded cleaned dumps, etc.) need to be augmented with public information. 
Contributions. In summary, we make the following contributions. 



— We present the first real large-scale federation setup in the context of RDF, 
implemented using the FedX federation mediator on top of 29 Bio2RDF 
SPARQL endpoints containing about 4.1 billion RDF triples. 



— Our description summarizes problems and solutions, as well as practical 
aspects of the federation setup. All experiments can be reproduced by anyone 
following the instructions outlined in Section [4] 

— We set up a public demonstrator of our Bio2RDF federation, supporting live 
queries against and browsing of the underlying federated data graphj^] 

— An exhaustive evaluation along different dimensions - including technical 
setup aspects, performance and scalability, and network latency - proves the 
feasability of federated RDF data management in large-scale settings. 

— Our experiments reveal open issues and current limitations, which serve as a 
guideline for future work in the area of federated semantic data processing. 

Structure. After a discussion of related work in Section we turn towards 
a description of the federation technology, the FedX system, in Section [2] Next, 
Section [3] describes the federation setup, including a motivation of the chosen 
scenario, a description of the datasets, and a general discussion of the benchmark 
queries. In Section [3] we describe the infrastructure setup, motivate the different 
experimental scenarios, metrics, and present an exhaustive discussion of the 
experimental results. Finally, we elaborate on the implications of our results for 
future work and conclude with some final remarks in Section [5] 



1.1 Related Work 

With the uptake of Linked Data in recent years, the topic of integrated query- 
ing over multiple distributed data sources has attracted significant attention. 
In order to join information provided by these different sources, efficient query 
processing strategies are required, the major challenge lying in the natural dis- 
tribution of the data. So far, the commonly used approach for query processing 
in large scale integration scenarios is still to integrate relevant data sets into a 
local, centralized triple store. Examples of such integrated repositories are the 
LOD cloud cachc|^]or Factforgc]^] that integrate significant subsets of the Linked 
Open Data cloud. As a more domain specific example, Linked Life Dataj^j inte- 
grates 23 datasources from the biomedical domain. Following a similar approach, 
the OpenPHACTS projeclQ attempts to build an integrated resource of multiple 
databases in the pharmaceutical space. 

Yet recently one can observe a paradigm shift towards federated approaches 
over the distributed data sources with the ultimate goal of virtual integration 
[7j[8] . A recent overview and analysis of federated data management and query 
optimization techniques is presented in [6]. 

Basic federation capabilities have been added to SPARQL with the SPARQL 
1.1 Federation extension^] They introduce the SERVICE operator, which allows 
for providing source information directly within the SPARQL query. Aranda ct 

3 See http : //biof ed. f luidops .net 

4 |http: //lod. openlink sw. com/| 
J http://factforge.net/ 

6 http://linkedlifedata.com/ 

7 http://www.openphacts.org7 

8 http : //www. w3 . org/TR/sparql 11- federated- query/ 



al. [T] provide a formal semantics for the language extensions. While our fed- 
eration approach in FedX also supports SPARQL 1.1 Federation, it does not 
require these extensions. Instead, it is fully compatible with the SPARQL 1.0 
query language, i.e. multiple distributed data sources can be queried transpar- 
ently as if the data resided in a virtually integrated RDF graph. Source selection 
is achieved through automated means over a set of defined sources (which can 
be dynamically extended) without explicit specification in the query. 

In [9] we introduced the FedBench benchmark suite for testing and analyz- 
ing the performance of federated query processing strategies. Our experiments 
presented in this paper build upon the FedBench benchmark, but evaluate a 
federation scenario of a significantly larger scale. 



2 FedX 

In the following we give some insights into the technologies and concepts of 



FedX 10 , which is used as the federation mediator in our experimental study. 
FedX is a practical framework for transparent access to Linked Data sources 
through a federation. By virtually integrating multiple heterogeneous sources, 
the federation mediator exposes the union of all source graphs transparently 
to the user, i.e. the user can evaluate queries as if the data resided in a single 
triple store. Federation members are specified as a list of SPARQL endpoints, 
which can be added to (or removed from) the federation on-demand, since no 
precomputed statistics are required for query processing. With its federation- 
tailored optimization techniques discussed below, FedX enables an efficient and 
scalable SPARQL query processing for different practical federated settings. 

The query processing workflow in FedX is depicted in Figure [I] FedX first 
parses the query into an internal tree-like representation, which is then opti- 
mized using various techniques. Optimization in FedX includes source selection 
(i.e., finding the relevant sources for each triple pattern using SPARQL ASK re- 
quests), forming exclusive groups (i.e., grouping those triple patterns that have 
the same single source) , and a rule-based join reordering approach. At runtime 
FedX manages a so-called source selection cache, containing information about 
which endpoints can potentially yield results for a given triple pattern. With 
this cache, FedX is able to reduce the number of requests since it can prune 
endpoints that are not relevant for the evaluation of subqueries directly. 
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Fig. 1: Federated Query Processing Model of FedX 



As a user-facing frontent built on top of the federation managed by FedX,we 
provide a browser-based demo system based on the Information Workbench^] a 
Linked Data platform which allows to declaratively use widgets within a semantic 
wiki to interact with the underlying Linked Data graph. 

3 Experiment Scenario: Federating Bio2RDF 

For our experimental study of a large scale federation we decided to use data 
sets from the life science domain. The industries in the life sciences (including 
pharmaceuticals, bio technology) have been an early adopter of semantic tech- 
nologies and the value of providing integrated access to distributed data sources 
has been demonstrated in many practical applications [3]. Most of the Linked 
Data data sets in the life sciences have been published as part of the Bio2RDF 
initiative, with the goal to provide interlinked life science data to support bi- 
ological knowledge discovery. Compared to other domains, the data sets that 
have been developed for the life science domain are of rather high quality and 
very well interconnected. Consider as an example the Drugbank dataset which 
provides direct links for most drugs to the corresponding KEGG compounds. 

For our federation we have selected 29 data sets, covering - to the best of 
our knowledge - all relevant publicly available data sets in the domain. In total, 
the selection comprises more than 4 billion triple. Table [T] lists all datasets and 
depicts the number of triples and entities, as well as the main instance type(s). 



# 


Dataset 


#Triples 


#Entities 


Instance type(s) 


1 


CollMap 


149k 


60k 


biopax-2:protcin 


2 


ChEBI 


650k 


238k 




3 


DailyMcd 


163k 


68k 


daily mcd:drugs 


4 


Disease Ontology 


145k 


110k 




5 


DBpcdia Subset 


70M 


31M 


e.g. dbo:Drug 


6 


Discasomc 


75k 


30k 


discasomc:gcncs 


7 


DrugBank 


0.5M 


290k 


drugbank: drugs 


8 


Entrcz-Gcnc 


161. 5M 


67M 


cntrczgcnc: Gene 


9 


Gcncwiki 


1.0M 


391k 




10 


KEGG 


2.4M 


1M 


kcgg:Compound, kcgg:Drug, kcgg:Enzymc, kcgg:Rcaction 


11 


Mappings 


2.8M 


4.1M 




12 


Pubmcd 


1.4B 


299M 


pubmed: Citation 


13 


UMLS 


121M 


27. 7M 


skos:Conccpt 


14 


Uniprot 


2.3B 


495M 


uniprot: Protein, uniprot: Journal 


15 


BiogGRID 


12M 


4.7M 


biopax-2:protcin 


16 


Gene Ontology 


320k 


187k 


skos:Conccpt 


17 


HapMap 


22M 


43M 




18 


HPRD 


2M 


777k 


biopax-2:protcin 


If) 


Humancyc 


327k 


143k 




20 


IMID 


83k 


36k 


biopax-2:protcin 


21 


IntAct 


16. 6M 


5.5M 


biopax-2: protein 


22 


LHGDN 


316k 


160k 




23 


LinkedCT 


7.0M 


2.8M 


linkedet : trials, linkedet :condition 


24 


MINT 


2.1M 


6M 


biopax-2: protein 


25 


NCI-Nature 


611k 


237k 


biopax-2 :protcin 


26 


Phcnotypc Ontology 


84k 


36k 




27 


Rcactomc 


815k 


330k 


biopax-2 :protein 


28 


Sidcr 


102k 


30k 




29 


Symptom 


4.2k 


2k 





Table 1: Lifescience datasets used for federation scenario: 29 datasets/4B+ RDF triples 



http: / /www. fluidops.com/information-workbench/ 



Table 2: Summary of query characteristics. Operators: And ("."), Union, Filter, 
Group By, Count (if), Optional; Solution modifiers: Distinct, Limit, Offset, 
Order By; Structure: Star, Chain, Hybrid 



FedBench Life Science (LS) 




Linked Life Data (LLD) 




Op. 


Mod. 


Struct. 


#Res. 






Up. 


Mod. 


Struct. 


#Res. 


1 


u 






1159 




1 


A 




s 


167 


■2. 


AU 






319 




■2. 


A 


D 


H 


22 


A 


A 




H 


9869 




3 


A 




C 


70 


4 


A 




H 


3 




4 


A 


D 


c 


210 


5 


A 




H 


395 




5 


A 


D 


c 


45 


t> 


A 




H 


28 




t> 


AF 


D 


H 


63 


7 


AVO 




H 


109 




7 


A 




H 


59 












8 


A 




H 


131 












9 


# 






1 












10 


AG 




C 


2 



Example, Life Science Query 4: For all drugs 
in DBpcdia, find all drugs they interact with, 
along with an explanation of the interaction. 

SELECT ?Drug ?IntDrug ?IntEffect WHERE { 
?Dmg rdf:type dbpedia-owl :Drug . 
?y owl:sameAs ?Dmg . 
?Int drugbank : interactionDrugl ?y . 
?Int drugbank: interactionDrug2 ?IntDrug . 
?Int drugbank:text ?IntEffect . } 



Example, Linked Life Data Query 8: Select 
all human genes located on the Y-chromosomc 
with known molecular interactions. 

SELECT ?genedescription ?taxonomy Tinteraction WHERE { 
Tinteraction biopax2 : PARTICIPANTS ?p . 
?interaction biopax2:NAME ?interactionname . 
?p biopax2: PHYSICAL-ENTITY ?protein . 
?protein skos : exactMatch ?uniprotaccession . 
?uniprotaccession core : organism ?taxonomy . 
?taxonomy core : scientif icName 'Homo sapiens' . 
?geneid gene :uniprotAccession ?uniprotaccession . 
?geneid gene : description ?genedescription . 
?geneid gene : chromosome >Y' . } 



Fig. 2: Selected benchmark queries 



Queries. We selected two query sets that implement realistic use cases on top 
of the life science data collection. The first query set (LS in Table [2) is a slightly 
modified version of the Life Science query set from the FedBench benchmark 
suite, updated to reflect changes in the schema and data of the latest versions of 
the respective data sets. The second query set (LLD in Table [2]) contains sample 
queries from Linked Life Data (cf. http://linkedlifedata.com/sparql) and 
represents typical queries that can be performed against the integrated set of 
life science databases. We limited the selection to those queries that can be 
answered based on publicly available data sets (i.e., without data exclusively 
available through the Linked Life Data system). Figure [2] exemplarily discusses 
two sample queries taken from the two query sets. 

Table [2] gives an overview of the benchmark queries and their properties, 
showing that they vastly vary in their characteristics. In particular, we indi- 
cate the SPARQL operators that are used inside the query (Op.), the solu- 
tion modifiers that were used additionally (Sol.), categorize the query struc- 
ture (Struct.), roughly distinguishing different join combinations - like subject- 
subject or subject-object joins - leading to different query structures commonly 
referred to as star-shaped, chain, or hybrid queries, and indicate the number of 
results (#Res.) on the federation datasets. 

A complete description of the data sets (including download links) and queries 
used in the benchmark is available at http://biofed.fluidops.net/. 



4 Experiments 



4.1 Infrastructure Description and Setup 

In our experiments we focus on two different federated settings. First, we set up a 
local federation to evaluate the performance and practicability of federated data 
processing with FedX in a controlled setting with low network latency, where 
all endpoints are deployed in a dedicated local environment. Complementary, 
the hybrid federation consists of a mix of local and remote SPARQL endpoints 
(the latter hosted in the Amazon AWS cloud), which allows us to study the im- 
plications arising in scenarios with higher network latency. The hybrid scenario 
reflects challenges in the enterprise context, where private, enterprise-internal 
data sources are combined with public SPARQL endpoints in a federated set- 
ting. To guarantee repeatability of the experiments we establish a controlled 
environment in both settings, i.e. we use SPARQL endpoints running on non- 
shared compute and storage resources. The details are descibed in the following. 

Local federation. For the local federation we provide access to the life- 
science datasets through individual SPARQL endpoints running in our local 
computing cluster. In this cluster we use two HP Proliant DL360 servers run- 
ning a 64bit Windows Server operating system, one with 8x2GHz CPU and 64GB 
RAM (Server 1), the other with 2x3GHz CPU and 20GB RAM (Server2), both 
backed by fast storage. The total available memory is distributed to the individ- 
ual SPARQL endpoints corresponding to the number of triples, e.g. the Uniprot 
endpoint got assigned a total memory of 14GB, while the smaller Drugbank 
endpoint is running in a 1.5GB process. The datasets 1 to 14 from Table [I] are 
deployed on Serverl and the remaining ones, 15 to 29, are deployed on Server2. 

The individual SPARQL endpoints are powered by a state-of-the-art triple 
store implementing the OpenRDF Sesame interface^] running in Tomcat 6 ap- 
plication server processes. Sesame is the de-facto standard framework for process- 
ing RDF data and offers access to RDF storage solutions through an easy-to-use 
API. The triple stores themselves can be accessed via SPARQL endpoints. 

Hybrid federation. For the hybrid setting we deployed selected SPARQL 
endpoints from the local infrastructure to an Amazon AWS EC2 instance. More 
precisely, we deployed the DrugBank, Uniprot, and Pubmed data sets in the 
AWS cloud. Like in the local setting, these data sets were deployed as individual 
SPARQL endpoints on top of a Tomcat 6 application server, using exactly the 
same database setup and memory assignment for the individual endpoints as in 
the local setting. The endpoints were hosted together on a single, high-memory 
AWS instance (type "m2.2xlarge") running 64bit MS Windows Server 2008 with 
13 EC2 Compute Units (4 virtual cores with 3.25 EC2 compute units each) and 
34.2GB memory. The data sets were attached to the instance using Amazon EBS 
storage volumes. The instance and volumes were both hosted in the AWS zone 
"EU West (Ireland)", allowing for fast communication between compute and 
storage infrastructure. Note that Ireland is the AWS zone closest to Germany, 
where the local endpoints and FedX were run. 



http: / /www. openrdf.org 



Mediator and benchmark driver. In both settings, the federation was 
driven by the FedX v2.0 federation mediator described in Section [2] FedX was 
configured to run over the set of the 29 Bio2RDF SPARQL endpoints, either 
using only local endpoints (in the local setting) or the combination of local 
and global endpoints described above in the federated setting. For running the 
experiments we used FedBenclj"] [9j , a comprehensive benchmark suite for an- 
alyzing the efficiency and effectiveness of federated query processing strategies 
over semantic data that provides customizable benchmark drivers. 

Metrics. The central measure in our experiments is the query evaluation 
time: in both the local and the hybrid scenario we report on the average elapsed 
time over five runs, assessed after five previous warmup runs. Following the 
guidelines described in [I], we indicate the geometric mean, which is defined as 
the n th root over the product of n values: compared to the arithmetic mean, the 
geometric mean flattens outliers, which - in our setting - occasionally arised, 
particularly in the hybrid federation, due to unpredictable effects such as punc- 
tually high network delays. Other metrics we discuss are (i) the number of re- 
quests sent to SPARQL endpoints from FedX during query evaluation, (ii) the 
number of triple patterns in the individual queries and (iii) the efficiency of the 
source selection algorithm in FedX. As we will discuss in the following, these are 
parameters that have significant influence on the query evaluation times. 



4.2 Experimental Results 

We start with a discussion of FedX' source selection strategy (cf. Section [2|, 
which forms the basis for the understanding of the subsequent results. In order 
to minimize the number of requests, FedX - prior to evaluating the query - 
sends ASK queries for the triple patterns contained in the query to all SPARQL 
endpoints, to identify which sources are potentially relevant for which patterns 
in the query. This information is then used to optimize query processing, such as 
sending patterns only to relevant endpoints or grouping subqueries that can be 
answered by a single endpoint alone. Visualizing the outcome of the source selec- 
tion strategy, Figure [3] shows, for each of the benchmark queries (i) the number 
of triple patterns in the query (plotted below the query name) and (ii) the min- 
imum, maximum, and average (over all triple patterns in the query) number of 
endpoints that have been identified as relevant for the triple patterns according 
to the source selection strategy. As an example, query LLD2 is composed out 
of 7 triple patterns, where the minimal pattern(s) retrieve non-empty results 
from 3 endpoints, the maximal pattern(s) retrieve results from 9 endpoints, and 
the average number of endpoints that contribute results to a triple pattern in 
LLD2 is about 7.2. As a whole, the diagram leads to two interesting observa- 
tions: first of all, the queries vary in complexity regarding the number of sources 
that (potentially) contribute to the query result: there are simple queries which 
can be answered by querying a single source, while others have triple patterns 
containing potential matches in up to 9, in the worst case even all 29 federation 



11 FedBench project page: http://code.google.eom/p/fbench/ 
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Fig. 3: Source selection analysis: relationship between queries and triple patterns 
w.r.t. relevant endpoints according to FedX' source selection algorithm. 



member^] Second, the results demonstrate that the source selection strategy 
of FedX is quite efficient, reducing the average number of sources involved in 
answering triple patterns to at most 10 out of 29 for all the queries, typically 
even less. Given that the number of requests is one of the main factors driv- 
ing evaluation time (as will be discussed in the following), this efficient source 
selection strategy can be seen as a cornerstone for the practicability of FedX. 

Figure]^ a) compares the query evaluation times for our 17 benchmark queries 
over the local and hybrid federation, with source selection caching enabled. Given 
the warmup phase prior to taking the measurements, an active source selection 
cache implies that FedX in this scenario has full knowledge about which sources 
can contribute results to which triple pattern in the input query. Starting with 
the discussion of the local federation setting, we can observe that all 17 queries 
return a result within 15s. f5 queries arc faster than 3s, 10 queries in the sub- 
second range, and 5 queries are even faster than 0.1s. Given that our queries 
represent a mix of dedicated benchmark queries designed particularly to test 
challenges in federated scenarios and real-world use cases from the Bio2RDF 
project, these numbers impressively demonstrate the practicability of FedX as a 
mediator for large-scale RDF federations in the billion triple range. 

In addition to a tabular representation of the evaluation time for the two 
settings (columns Local and Hybrid), Figure [4jb) summarizes the number of 
requests (#Req) sent to the SPARQL endpoints during query evaluation. We 
can observe a clear coincidence between the number of requests sent during query 



Query LS2 contains the triple pattern ?CAFF ?predicate 7object, which - taken 
alone - can be answered by all endpoints. 



evaluation and the query evaluation time; for instance, the five most expensive 
queries (in terms of runtime) - LS3, LS5, LS7, LLD4, LLD5 - are character- 
ized by the five highest numbers of requests sent to SPARQL endpoints. This 
indicates that the network delay is the dominating factor in query evaluation. 



(a) 



b) 



Local Federation 
Hybrid Federation 



i 



LS1 LS2 LS3 LS4 LS5 LS6 LS7 LLD1 LLD2 LLD3 LLD4 LLD5 LLD6 LLD7 LLD8 LLD9LLD10 





Local 


Hybrid 


#Req 


#Req(D) 


#Req(U) 


LSI 


0.021 


0.134 


1 


1 





LS2 


0.016 


0.090 


1 


1 





LS3 


2.356 


4.159 


1512 


1511 





LS4 


0.114 


0.085 


3 


1 





LS5 


2.037 


1.678 


815 


1 


91 


LS6 


0.194 


0.148 


84 


1 





LS7 


2.783 


5.451 


1355 


110 





LLD1 


0.016 


0.015 


1 








LLD2 


1.199 


1.282 


649 








LLD3 


0.170 


0.126 


75 








LLD4 


7.358 


7.067 


3043 


77 


78 


LLD5 


14.823 


17.167 


6301 


806 


109 


LLD6 


0.660 


1.142 


135 


21 


19 


LLD7 


0.514 


0.446 


162 





161 


LLD8 


1.345 


1.045 


521 


9 


148 


LLD9 


0.015 


0.016 


1 








LLD10 


0.122 


0.156 


75 









c) 





No Caching 


Caching 


^Savings 


LSI 


0.203 


0.134 


58 


LS2 


0.309 


0.090 


87 


LS3 


4.274 


4.159 


145 


LS4 


0.461 


0.085 


203 


LS5 


2.098 


1.678 


174 


LS6 


0.462 


0.148 


145 


LS7 


5.434 


5.451 


145 


LLD1 


0.440 


0.015 


87 


LLD2 


2.162 


1.282 


203 


LLD3 


0.429 


0.126 


87 


LLD4 


7.077 


7.067 


116 


LLD5 


16.952 


17.167 


174 


LLD6 


1,098 


1.142 


145 


LLD7 


0.066 


0.446 


144 


LLD8 


1.511 


1.045 


261 


LLD9 


0.063 


0.016 


29 


LLD10 


0.311 


0.156 


87 



Fig. 4: Experimental Results: (a) Graphical comparison of query evaluation time 
in local and hybrid federation; (b) Tabular listing of evaluation times and number 
of requests sent to SPARQL endpoints during query evaluation; (c) Influence of 
source selection caching in FedX on evaluation times in the hybrid setting. 



In order to study the effect of network latency in more detail, we next compare 
the results in the local federation with the hybrid federation. As expected, the 
query results in the hybrid setting are generally (yet not always) slower due to 
the higher network latency induced by the communication with the SPARQL 
endpoints in the AWS cloud. Going into more detail, Figure Pf|b) also shows 



the number of subqueries sent against the remote SPARQL endpoints Drugbank 
(#Req(D)) and Uniprot (#Req(U)). Based on these numbers, we can classify 
the queries into three classes. The first class contains queries that do not require 
communication between FedX and the remote endpoints (LLD1-3 and LLD9- 
10). For these queries, FedX' source selection cache helps to avoid expensive 
requests to the endpoints in the AWS cloud, so we observe no or only small 
overheads in evaluation time. The second class of queries, such as LS3, LS7, and 
LLD5, require a considerable amount of requests against the remote endpoints; 
as a consequence, the execution time in the federated setting increases. Still, the 
highest percental increase of about 100% can be observed for LS7, which still 
results in practical response times. Somewhat surpisingly, we can observe a third 
class of of queries, for which the hybrid setup even outperforms the local setup 
(LS4-6, LLD1, LLD3-4, and LLD7-8). This result can be explained by the fact 
that the overall load on the Amazon machine, which hosts only three endpoints 
- rather than 14 — 15 endpoints, as it is the case for the two local servers - is 
lower, which in turn results in generally faster response times for the subqueries 
sent to the endpoints. This shows that in many cases the advantages gained by 
distribution dominate the overhead imposed by increased communication costs. 

Finally, in Figure |4|c), we study the influence of source selection caching in 
FedX: the first two columns compare the evaluation times of the queries with 
source selection caching disabled vs. enabled in the hybrid setting; the #Savings 
column denotes the number of Ask requests (sent to endpoints in order to find 
out whether they can contribute answers to a given triple pattern) that could 
be saved when caching was turned on. As can be seen, caching leads to runtime 
savings in most cases. As a general trend, the percentual savings are particularly 
high whenever ^Savings is high compared to the overall number of requests, 
#Req, depicted in Figure |4jb); for instance, for query LS2, we save 87 requests 
(out of, in total, 87+1 = 88 requests to endpoints), which induces a significant 
percental speedup. For queries where the number of requests is already high (e.g., 
LS3, LS7, or LLD4), the caching benefits are negligible. In summary, the results 
show that the source selection cache is not crucial for efficient evaluation, thus 
proving the flexibility of FedX which allows to add new federation members ad 
hoc, without warming up caches or precalculating statistics. Still, source selection 
caching yields an additional speedup for most queries, which can be particularly 
beneficial in scenarios with high query loads involving many simple queries. 

5 Conclusions and Future Work 

We presented the first large-scale RDF federation in the billion triple range 
over Bio2RDF data sources, driven by the highly optimized Linked Data fed- 
eration mediator FedX. Our exhaustive and repeatable experimental evaluation 
demonstrates the practicability of our approach and studies various aspects driv- 
ing evaluation time. One cornerstone of evaluation performance is an efficient 
source selection strategy. It is crucial to minimize the number of requests sent to 
the individual SPARQL endpoint during query evaluation, which is the major 
bottleneck in efficient federated query processing. Going beyond this finding, 



our experiments identify settings in which the advantages gained by distribution 
dominate the overhead imposed by increased communication costs, thus lever- 
aging the benefits of a federated setup with autonomous compute endpoints. 

The focus of future work in this area therefore should lie on techniques to 
further minimize the communication efforts. One promising approach, which 
wc identified during our interpretation of query evaluation plans, is to exploit 
data set specific namespaces in URIs, in order to further improve the source 
selection process. Another promising approach aiming at a combination of the 
benefits of federation and centralization would be the automated colocation of 
data sets that exhibit frequent joins and therefore impose high communication 
costs, which could e.g. be reached by an adaptive query log analysis, combined 
with a caching layer maintained inside the federation layer. 
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